A companion document for the R/Medicine 2021 Virtual Conference’s Introduction to R for Medical Data workshop.
Welcome to R/Medicine 2021. In this workshop we will be doing a hands-on introduction to R for medical data.
Learn how your organization could use R
Write your first R code!
See some highlights of the R ecosystem
You’ll need the following during the workshop:
A computer with internet connection
An RStudio Cloud account—free
Please note while you can use a local version of R and Rstudio for the workshop, we cannot provide support for you during the workshop to address issues that come up your local device and software. We use RStudio Cloud because we can do all the technical preparation for participants ahead of time and spend our limited workshop time on learning R as fast as possible.
If you would like to some additional preparation for the workshop, I recommend learning some about Markdown in this ten minute tutorial.
Introduction ~xxx minutes
Systems check ~xxx minutes
Learn and Do ~xxx minutes
Wrap Up and Discussion ~xxx minutes
For this workshop we will be using RStudio Cloud.
Why? Because there is nothing that you need to download!
You don’t need to use RStudio to use R, but I do!
RStudio on your computer and RStudio Cloud look very similar.
R and RStudio are free to download to your computer.
Log into RStudio Cloud
Raise hand in zoom if you are having trouble
Once you are in RStudio Cloud, click on the project called “Intro to R for Medical Data.”
Let’s open the file in the file pane (number 4 in the image above) called Intro_to_R_for_medical_data_workshop.Rmd. This is what we will be working for the workshop.
Your document will look something like this.
The YAML header contains the special instructions on how to create the output document. We won’t do much with it here today, but it is a very powerful way to make your Rmd file as bespoke as you want it!
Code chunks are where the code will go.
Code chunks have a gray background.
#This is a code chunk!
#Here is a simple calculation
1 + 2
[1] 3
You can run a code chunk by pressing the green play button.
No problem! R can ingest data from lots of locations including from Excel spreadsheets that you might already be very familiar with.
XXX smoke data from XXX with data dictionary here XXX put in link.
Run the code chunk below by presening that green play button.One thing that Excel does well is to provide an interactive visual representation of the data. This allows you to inspect it by sorting and filtering. RStudio actually does this well, too, with one difference - it won’t let you change any of the data while you inspect it.
Look on the right at the Environment pane (you might have to click icon that looks like a spreadsheet on the “Environment” tab) and find the entry smoke_complete. This is the data frame you just created inside of R’s memory. (If you don’t see smoke_complete, try running the code chunk above again).
Within the Environment pane, click on the smoke_complete to view the data.
Go ahead and try to edit one of the values in this viewer. You will find that you can’t. It would have been easy for the RStudio programmers to allow editing of specific values, but they decided not to add that feature.
Why do you think this was designed that way?
Next we will use the glimpse() function to learn about our data.
How many rows of data do we have in our data?
How many columns are there? What are some of the column names?
Can you tell what kind of variable is stored in the columns?
Rows: 1,152
Columns: 20
$ primary_diagnosis <chr> "C34.1", "C34.1", "C34.3", "C34.~
$ tumor_stage <chr> "stage ia", "stage ib", "stage i~
$ age_at_diagnosis <dbl> 24477, 26615, 28171, 27154, 2337~
$ vital_status <chr> "dead", "dead", "dead", "alive",~
$ morphology <chr> "8070/3", "8070/3", "8070/3", "8~
$ days_to_death <chr> "371", "136", "2304", "NA", "NA"~
$ state <chr> "live", "live", "live", "live", ~
$ tissue_or_organ_of_origin <chr> "C34.1", "C34.1", "C34.3", "C34.~
$ days_to_birth <dbl> -24477, -26615, -28171, -27154, ~
$ site_of_resection_or_biopsy <chr> "C34.1", "C34.1", "C34.3", "C34.~
$ days_to_last_follow_up <chr> "NA", "NA", "2099", "3747", "357~
$ cigarettes_per_day <dbl> 10.9589041, 2.1917808, 1.6438356~
$ years_smoked <chr> "NA", "NA", "NA", "NA", "NA", "N~
$ gender <chr> "male", "male", "female", "male"~
$ year_of_birth <chr> "1936", "1931", "1927", "1930", ~
$ race <chr> "white", "asian", "white", "whit~
$ ethnicity <chr> "not hispanic or latino", "not h~
$ year_of_death <chr> "2004", "2003", "NA", "NA", "NA"~
$ bcr_patient_barcode <chr> "TCGA-18-3406", "TCGA-18-3407", ~
$ disease <chr> "LUSC", "LUSC", "LUSC", "LUSC", ~
Next, we will use the skim() function to learn even more about our data.
Press the green play button in the code chunk below.
Now can you tell the different column types we have? What types are in this data?
Look at the mini histograms at the bottom of the output? What can you learn very quickly about your data for the distributioncigarettes_per_year variable
| Name | smoke_complete |
| Number of rows | 1152 |
| Number of columns | 20 |
| _______________________ | |
| Column type frequency: | |
| character | 17 |
| numeric | 3 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| primary_diagnosis | 0 | 1 | 5 | 6 | 0 | 17 | 0 |
| tumor_stage | 0 | 1 | 7 | 12 | 0 | 11 | 0 |
| vital_status | 0 | 1 | 4 | 5 | 0 | 2 | 0 |
| morphology | 0 | 1 | 6 | 6 | 0 | 12 | 0 |
| days_to_death | 0 | 1 | 1 | 4 | 0 | 279 | 0 |
| state | 0 | 1 | 4 | 4 | 0 | 1 | 0 |
| tissue_or_organ_of_origin | 0 | 1 | 5 | 5 | 0 | 16 | 0 |
| site_of_resection_or_biopsy | 0 | 1 | 5 | 5 | 0 | 16 | 0 |
| days_to_last_follow_up | 0 | 1 | 1 | 4 | 0 | 459 | 0 |
| years_smoked | 0 | 1 | 1 | 2 | 0 | 48 | 0 |
| gender | 0 | 1 | 4 | 6 | 0 | 2 | 0 |
| year_of_birth | 0 | 1 | 2 | 4 | 0 | 67 | 0 |
| race | 0 | 1 | 5 | 41 | 0 | 6 | 0 |
| ethnicity | 0 | 1 | 12 | 22 | 0 | 3 | 0 |
| year_of_death | 0 | 1 | 2 | 4 | 0 | 21 | 0 |
| bcr_patient_barcode | 0 | 1 | 12 | 12 | 0 | 734 | 0 |
| disease | 0 | 1 | 4 | 4 | 0 | 3 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| age_at_diagnosis | 0 | 1 | 24175.38 | 3926.13 | 7855.00 | 22068.75 | 24750.50 | 26927.00 | 32872 | <U+2581><U+2581><U+2583><U+2587><U+2582> |
| days_to_birth | 0 | 1 | -24175.38 | 3926.13 | -32872.00 | -26927.00 | -24750.50 | -22068.75 | -7855 | <U+2582><U+2587><U+2583><U+2581><U+2581> |
| cigarettes_per_day | 0 | 1 | 2.61 | 2.04 | 0.01 | 1.37 | 2.19 | 3.29 | 40 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
Now we will take the data we have loaded and make some plots!
Produce a histogram of smoke_complete using geom_histogram(), mapping these variables to the following aesthetics:cigarettes_per_day to the x aesthetic.
plot_1 <- ggplot(data = smoke_complete,
aes(x = cigarettes_per_day)) +
geom_histogram()
plot_1
We’ve made a plot, but there are some issues with it:
Let’s make our plot a bit prettier.
theme_classic() to the plotYou got a message after running that code that: “stat_bin() using bins = 30. Pick better value with binwidth.”
Let’s modify the width of our histogram’s bins in the code chunk below to 2. Then try 10.
Finally let’s explore the differences in daily cigarettes for the gender variable.
gender to the facet
ggplot2: A Grammar of Graphicsggplot2 is an extremely powerful software library for visualization.
The gg is short for Grammar of Graphics, which means that visualizations are expressed in a specific and consistent way for different types of visualizations in the package. There is a wonderful universe of complementary packages in R that extend ggplot to many different types of visualizations.
Here’s a visual summary of the different parts we’re talking about today. There are many parts to visualizations, but many of us don’t have the words to describe the different types of parts of a graph. Here we will spend some time breaking down the meeting of different constituent parts.
Note: there are different ways to write ggplot code to get the same output, sort of formal versus causal conversation. Here we will focus on the more formal style because it is the most explicit, but also more verbose! As you get familiar with ggplot, you will likely shorten your code a bit at times from this very explicit method. We won’t cover every layer in depth, but know that breaking down visualizations to these constituents allows for tremendous control.
Image from Thomas Lin Pedersen
ggplot2 codeA ggplot2 graphic consists of a mapping of variables in data to the aes()thetic attributes of geom_etric objects.
In code, this is translated as:
# start the plot object with ggplot()
# add assign smoke_complete to the data argument
ggplot(data = smoke_complete,
# map the variables to visual properties of the graph
mapping = aes(
# map the x-axis to age_at_diagnosis
x = age_at_diagnosis,
# map the y-axis to cigarettes_per_day
y = cigarettes_per_day,
# map the color to the disease variable
color = disease
)) +
# add the geometry and the alpha
geom_point(alpha = 0.2) + # complete the geom to geom_point
# add labels to your plot
labs(title = "Age of diagnosis of cancer by daily cigarette consumption", # plot title
x="Age in days", # x-axis label
y="Cigarettes per day", # y-axis label
color = "Disease type") + # label for color legend
# add a facet to your plot
facet_grid(cols = vars(gender)) + # facet by gender variable
# add a theme to your plot
theme_bw()
Things to note: we chain these things together with + (plus sign).
Set the data argument to smoke_complete, then run the code chunk.
Add mapping arguments to the aes function. Assign the variable age_at_diagnosis to x, cigarettes_per_day to y, and disease to color.
Run the code chunk. How does it look different from the prior code chunk?
Add geom_point().
Ok, wow, looking great!! Any major differences you notice between the graph provided before the step by step process in code chunk called full_example above?
In the next modification, modify the alpha argument to vary the opacity of the points. You can vary the value from 0 to 1. Try a few different values to see the difference.
Wait! What little trick was that? See the variable plot_5 in the code chunk above? That is storing your whole code with your nice new labels. Instead of continuing to rewrite your code, you can just modify as needed. Below we are going to facet by gender to separate out male and female to inspect the differences.
Finally we will play a bit with different themes. Pick two of the below themes and then add them to the plot_5 and run the code chunk to compare them.
Theme options:
There are plenty of out of the box themes in ggplot2, but you can make a theme, use other people’s, or even use a theme your organization might already have. Let’s load the `ggthemes’ library.
Rarely is your data going to be in the form you need it to be analyzed and plotted. You will often need to wrangle your data and change the shape of it a bit.
Let’s discuss a bit different ways we might need to reshape data.
Discussion
There are different packages and ways people wrangle data with R, but we’re going to demonstrate using some packages from the tidyverse, which is a whole ecosystem of R packages organized around having tidy data.
From our smoke_complete data, let’s select two columns to keep: gender and days_to_death.
# A tibble: 1,152 x 2
gender days_to_death
<chr> <chr>
1 male 371
2 male 136
3 female 2304
4 male NA
5 female NA
6 male 345
7 male 716
8 male 2803
9 male 973
10 male 1097
# ... with 1,142 more rows
There is a lovely package called magrittr that was loaded earlier, but includes something called a pipe, that looks like this in code: %>%. It allows use to call a nicely pipe data and preform lots of tasks on it.
Run the code below and inspect the output. How does it compare to the output from the prior code chunk?
# A tibble: 1,152 x 2
gender days_to_death
<chr> <chr>
1 male 371
2 male 136
3 female 2304
4 male NA
5 female NA
6 male 345
7 male 716
8 male 2803
9 male 973
10 male 1097
# ... with 1,142 more rows
Now let’s meet filter().
# A tibble: 2 x 20
primary_diagnosis tumor_stage age_at_diagnosis vital_status
<chr> <chr> <dbl> <chr>
1 C34.3 stage ib 19025 dead
2 C34.3 stage ib 19025 dead
# ... with 16 more variables: morphology <chr>, days_to_death <chr>,
# state <chr>, tissue_or_organ_of_origin <chr>,
# days_to_birth <dbl>, site_of_resection_or_biopsy <chr>,
# days_to_last_follow_up <chr>, cigarettes_per_day <dbl>,
# years_smoked <chr>, gender <chr>, year_of_birth <chr>,
# race <chr>, ethnicity <chr>, year_of_death <chr>,
# bcr_patient_barcode <chr>, disease <chr>
The pipe alternative way.
# A tibble: 2 x 20
primary_diagnosis tumor_stage age_at_diagnosis vital_status
<chr> <chr> <dbl> <chr>
1 C34.3 stage ib 19025 dead
2 C34.3 stage ib 19025 dead
# ... with 16 more variables: morphology <chr>, days_to_death <chr>,
# state <chr>, tissue_or_organ_of_origin <chr>,
# days_to_birth <dbl>, site_of_resection_or_biopsy <chr>,
# days_to_last_follow_up <chr>, cigarettes_per_day <dbl>,
# years_smoked <chr>, gender <chr>, year_of_birth <chr>,
# race <chr>, ethnicity <chr>, year_of_death <chr>,
# bcr_patient_barcode <chr>, disease <chr>
Together select() and filter() will be big workhorses in your data wraggling toolkit. However, at first it is easy to confuse which function does what.
select() selects the columns to stay.
filter() filters the rows to keep that meet certain conditions placed on columns. So in the example above, we filtered the data to only keep the rows where the bcr_patient_barcode was equal to “TCGA-18-3412.”
We meet the pipe %>% earlier, but let’s see how we can combine multiple pipes in a row to get a more complicated operation done.
# A tibble: 1,032 x 2
age_at_diagnosis gender
<dbl> <chr>
1 24477 male
2 26615 male
3 28171 female
4 27154 male
5 23370 female
6 19025 male
7 26938 male
8 28430 male
9 30435 male
10 24019 male
# ... with 1,022 more rows
Let’s make age_at_diagnosis more human friendly by creating a new column that divides the days by 365.25. (There are great R packages to handle data, times, duration, intervals, and all those other message time and date issues!)
# A tibble: 1,152 x 2
age_at_diagnosis age_at_diagnosis_years
<dbl> <dbl>
1 24477 67.0
2 26615 72.9
3 28171 77.1
4 27154 74.3
5 23370 64.0
6 19025 52.1
7 26938 73.8
8 28430 77.8
9 30435 83.3
10 24019 65.8
# ... with 1,142 more rows
There are several great packages that can make lovely tables that are publication ready.
One quick example here of our prior code chunk output table_1. We take that same object and let a package called kableExtra work some magic on it!
| age_at_diagnosis | age_at_diagnosis_years |
|---|---|
| 24477 | 67.01437 |
| 26615 | 72.86790 |
| 28171 | 77.12799 |
| 27154 | 74.34360 |
| 23370 | 63.98357 |
| 19025 | 52.08761 |
| 26938 | 73.75222 |
| 28430 | 77.83710 |
| 30435 | 83.32649 |
| 24019 | 65.76044 |
R Markdown documents, like this document, allow you to place text and analysis with the code all in a single document and output the result into different formats such as an html webpage, a pdf, or even a word document.
Let’s quickly knit a document to seee the output. In pane 4, or the file pane, find the document called knit_preview.Rmd and open it. We will see examples of knitting the same .Rmd file to a word document, a pdf, and to html.
Within R you can type ? followed by the name of the function eg ?filter() in the console.
In RStudio, you can look at the help tab in pane 4.
However, in the end your best resource is probably going to be searching online for the issue. From online forms, to blog posts, to twitter threads there is a ton of content out there, but crafting a good search query is a workshop in itself!
ggplot2: Elegant Graphics for Data Analysis
The list is always growing! The trouble isn’t finding good resources, it’s finding time to read them!
This work was made possible by the distill, ggplot, dplyr, and rmarkdown packages.